Discovering Semantic Patterns in Bibliographically Coupled Documents

نویسنده

  • Jian Qin
چکیده

ISSUESIN DISCOVERING KNOWLEDGE IN BIBLIOGRAPHIC databases are addressed. An example of semantic pattern analysis is used to demonstrate the methodological aspects of knowledge discovery in bibliographic databases. The semantic pattern analysis is based on the keywords selected from the documents grouped by bibliographical coupling. The frequency distribution patterns suggest the existence of a common intellectual base with a wide range of specialties and marginal areas in the antibiotic resistance literature. The resulting values for keyword density per rank show a difference of ten times between the specialty and marginal keyword densities. The possibilities and further studies of' incorporating knowledge discovery results into information retrieval are discussed. INTRODUCTION Knowledge discovery in databases (KDD) is considered a process of nontrivial extraction of implicit, previously unknown, and potentially useful information (such as knowledge rules, constraints, regularities) from data in databases (Chen, Han, &Yu, 1996, p. 866). Most research on KDD has focused on applications in business operations and well-structured data. Knowledge discovery in textual databases has been underemphasized (Trybula, 1997). Among the limited publications on KD in textual databases, the full-text document data are the primary source of analysis. Lent, Agrawal, and Srikant (1997) developed a patent mining system at IBM for identifjmg trends in large textual databases over a period of time. They Jian Qn, School of Information Studies, Syracuse University, 4206 Center for Science and Technology, Syracuse, hT 13244 LIBRARY TRENDS, Vol. 48, No. 1, Summer 1999, pp. 109-132 01999 The Board of Trustees, University of Illinois 110 LIBRARY TRENDS/SUMMER 1999 used sequential pattern mining to identify recurring phrases and generate histories of phrases, after which they then extracted phrases that satisfied a specific trend. Discovering associations among the keywords in texts is another area of research in Ku in textual databases. Using background knowledge about the relationships of keywords, Feldman and Hirsh (1996) studied associations among the keywords or concepts representing the documents. The knowledge base they built supplies unary or binary relations among the keywords representing the documents. Feldman, Dagan, and Hirsh (1998) developed a system for Knowledge Discovery in Text (KDT) that extracts keywords to represent document contents and allows users to browse a list of keywords that co-occur with another keyword(s) for knowledge discovery purposes. Mining in full-text documents attempts to extract useful associations and patterns for representing the document content, including clustering, categorization, summarization, and feature extraction. While many studies using data from bibliographic databases were not conducted in terms of KDD or data mining, they nevertheless bear the marks of KDD’s techniques and analysis. Such examples can be found in citation and cocitation analysis (Kassler, 1965;Small, 1973; Small & Sweeney, 1985; Braam, Moed, & van Raan, 1991), keyword classifications (Sparck Jones &Jackson, 1970), investigation of indexing similarities between keywords and controlled vocabularies (Shaw, 1990; Qin, in press), and author mapping (Logan & Shaw, 1987). Discovering knowledge through mining textual data in bibliographic databases presents more problems than mining numerical data. One problem is that most fields in a bibliographic database have long character strings-e.g., author name, title, affiliation, journal title, and indexing terms (from both keywords and controlled vocabularies). Such long strings are usually difficult for statistical packages or data mining software to perform computational tasks. Unlike the full-text document source, bibliographic data are semi-structured. Although it may be an advantage over completely unstructured full-text documents, it also creates a challenge for mining tools that the data in the structured fields should not be mixed up when extracting data sets and performing analysis. Linguistic problems (such as singulars and plurals, stems and suffixes) and inconsistencies in abbreviating journal titles and institution names can also be challenging issues in mining bibliographic data. To obtain valid and reliable data for discovering trends and patterns in subject fields and research, data preprocessing and cleansing can become very time-consuming and both labor and intellectually intensive. However, the most challenging issue remains whether there is a chance for information retrieval systems to “be extended to become knowledge discovery systems,” or whether “the kinds of record existing in bibliographical and textual databases offer any possibility of analysis in ways similar to those in more structured factual databases” (Vickery, 1997, pp. 119-20). QIN/DISCOVERING SEMANTIC PATTERNS 111 This study selected a set of bibliographic records as the data source for discovering semantic patterns among the keywords in these records. The purpose of this keyword analysis was to discover if any semantic patterns existed in the keywords extracted from bibliographically coupled documents regarding antibiotic resistance in pneumonia. Also, if such patterns did exist, how the discovered knowledge about a subject field can be used to improve the effectiveness of knowledge representation and information retrieval. A preliminary test of antibiotic resistance in pneumonia literature found that documents citing the same publication not only co-cited other publications but also contained semantically similar or same keywords in the titles of cited publications. The frequency distributions of these keywords characterized three distinctive strata: a very small number of keywords falling into the highest frequency region, a relatively larger group with moderate occurrences, and a majority of them appearing only once or twice. If the terms occurring most frequently represent the intellectual base in this subject area (Small, 1973; Small & Sweeney, 1985) and the ones with medium occurrences represent the specialties, then the terms occurring least frequently represent the marginal terms. These marginal terms may be the links between the mainstream of the antibiotic resistance research to the less overt but promising research. The citation-semantic analysis is aimed at discovering semantic patterns of the antibiotic resistance literature so that the analysis process and semantic patterns can be programmed into tools that can assist information searchers in building search queries and customizing their postsearch analysis. Specifically, this project studied whether the distribution follows the three strata described earlier, how such distribution can be measured, and to what extent the keywords in these strata reflect the research front in antibiotic resistance. The methods used to preprocess and analyze the data are discussed in detail in the following sections.

منابع مشابه

Discovering Hidden Analogies in an Online Humanities Database

VOLUMINOUS DATABASES CONTAIN HIDDEN KNowLmm-i.e., literatures that are logically but not bibliographically linked. Unlinked literatures containing academically interesting commonalities cannot be retrieved via normal searching methods. Extracting hidden knowledge from humanities databases is especially problematic because the literature, written in “everyday” rather than technical language, lac...

متن کامل

Automatically Discovering Semantic Links among Documents and Applications

ABSTRACT Automatically discovering semantic links among documents is the basis of developing advanced applications on large-scale documentary resources. This paper proposes an approach to automatically discover semantic links in a given document set. It has the following advantages: (1) It does not rely on any predefined ontology. (2) The semantic link networks and relevant rules automatically ...

متن کامل

Relationships at the Heart of Semantic Web: Modeling, Discovering, and Exploiting Complex Semantic Relationships

The primary goal of today’s search and browsing techniques is to find relevant documents. As the current web evolves into the next generation termed the Semantic Web, the emphasis will shift from finding documents to finding facts, actionable information, and insights. Improving ability to extract facts, mainly in the form of entities, embedded within documents leads to the fundamental challeng...

متن کامل

Automatically constructing semantic link network on documents

Knowing semantic links among resources is the basis of realizing machine intelligence over large-scale resources. Discovering semantic links among resources with limited human interference is a challenge issue. This paper proposes an approach to automatically discovering and predicting semantic links in a document set based on a model of document semantic link network (SLN). The approach has th...

متن کامل

A semantic partition based text mining model for document classification

Feature Extraction is a mechanism used to extract key phrases from any given text documents. This extraction can be weighted, ranked or semantic based. Weighted and Ranking based feature extraction normally assigns scores to extracted words based on various heuristics. Highest scoring words are seen as important. Semantic based extractions normally try to understand word meanings, and words wit...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

متن کامل
عنوان ژورنال:
  • Library Trends

دوره 48  شماره 

صفحات  -

تاریخ انتشار 1999